Part 1 (65 points)

Overview

Rules

  • You may work in groups
  • You can use any resources you want (books, notes, internet, friends, teachers)
  • You must place all of your answers in a Word document and make it clear which question you are answering at the time.
  • For each question in Part 1, copy and paste the code you used to solve the problem into your Word document. If the questions asks you to provide additional information, please include your answers to the question along with your code in the Word document.
  • Each question in this portion of the midterm is worth 5 points

Background

During this midterm you’ll demonstrate your R proficiency while solving a modern data analysis problem using the skills you’ve acquired to this point. Most tasks will be broken down by week to make examples easier to find in your notes.

On to the project! Who funds presidential campaigns? Who has raised the most money for the 2016 campaign to date (as of 10/05/2015)? Has it impacted polling numbers? Let’s find out!

img

The Data

Your professor downloaded the Federal Election Commission’s 2016 Presidential Campaign Finance data set from http://www.fec.gov/disclosurep/PDownload.do on 10/05/2015.

Here’s the official description of the data. Please read through this and familiarize yourself with the variable descriptions.

You can download a clean version of data set from here on CourseConnect.

Weeks 4 & 5

In this section, you’ll use your data import skills, R programming skills and string manipulation knowledge to read in our data set and prepare it for later analysis.

1)

Your first task will involve reading the data set into R. Download the data using the link above; be sure to remember where you saved it. Lucky for you, this is a well formed CSV file and you should be able to read it in using the standard methods we learned about. Please be sure to set the stringsAsFactors argument to FALSE when you are reading the data. If you do not set the stringsAsFactors argument to FALSE you will have a difficult time completing the midterm. Also, please name your imported data set dat.

2)

Some of these campaign donations have been refunded. Use grep to count the number of times the word Refund occurs in the receipt_desc column.

3)

Create a new column in the data frame that contains the only the month information stored in the column contb_receipt_dt – i.e. extract JUN for 27-JUN-15, APR for 13-APR-15, etc. Name your newly created column month_abrv.

4)

Create a new column in the data frame that includes only the standard 5 digit zip code. Start out by creating an empty column to store the new variable by running the following code:

dat$five_zip <- NA

Now, use a for loop to iterate over the rows of the data frame and populate five_zip according to the following rules:

  • If the row’s value for contbr_zip has 9 characters in it, set the value of five_zip to be the first 5 digits in the string. For example, the value of contbr_zip in the first row is 090960009, so we’d set the value of five_zip to be 09096.
  • If the row’s value for contbr_zip has any other number of characters, set the value of five_zip to be the same value of contbr_zip.

Warning, your for loop is going to take awhile to run! Like minutes!

5)

Which zip code has donated the most frequently? Which state is that zip code located in?

Week 1

During week 1 we learned how to install packages from remote repositories like CRAN. Here you’ll install a package to help parse some donation dates out of the FEC data set. You’ll also demonstrate your ability to learn about a new function in an R package and use it to solve a specific problem.

We have not yet discussed how R deals with variables that store information about dates and times. Part of the reason for this is the fact that dealing with date-time data in R can cause a lot of headaches. The lubridate package makes dealing with date-time data much easier.

6)

Install the lubridate package and and skim through this introduction to get a feel for package’s intended use. Copy and paste the code you used to install the package into your Word document.

7)

The contb_receipt_dt column in our data set stores the reported contribution receipt date for each gift. What is the data type of the contb_receipt_dt column?

8)

The following lubridate functions allow you to convert date-time variables to R’s native date-time representation: ymd(), mdy(), and dmy(). Figure out which one of these functions you need to use to parse the data stored in contb_receipt_dt. Use the correct function to create a new column, date_parsed, that stores the parsed values.

Week 2

Week 2 introduced us to R’s basic data types and data structures. Here you’ll demonstrate your knowledge of each of these.

9)

How many columns in dat are character? How many columns are numeric or integer?

10)

Find the transaction with the smallest value of contb_receipt_amt in our data set. What is it’s amount. Which candidate is it associated with? Find the donation matching this refund by grep’ing for the donor’s last name in the data set. What is this donor’s contbr_occupation?

Week 4

Week 4 taught us about the ggplot2 graphics system. For each of the questions in this section, please export a copy of the graph you generate and include it in the Word document as well.

11)

Here we’re going to examine the number of gifts given to each candidate on each day.

Run the following code that uses dplyr to compute the total amount of donations received by each candidate on each date in the data set.

library(dplyr)
cand_date <- dat %>% 
  group_by(cand_nm, date_parsed) %>% 
  summarise(total_gifts = sum(contb_receipt_amt))

Now, use ggplot to plot the cand_date data set using a line graph (geom_line). Place date_parsed on the x-axis and total_gifts on the y-axis. Create an aesthetic mapping for color that creates individual uniquely-colored lines for each candidate.

12)

Make some boxplots to see how the distribution of contb_receipt_amt changes over time. Use the month_abrv variable you created in question 3) as your x-variable to show how the distribution of contb_receipt_amt changes from month-to-month.

Extra credit if you can figure out how to get the months ordered correctly (hint).

13)

Make some boxplots to see how the distribution of contb_receipt_amt differs between candidates. Use the cand_nm x-variable and geom_boxplot. Write 2 or 3 sentences describing any trends or oddities you notice.

Part 2 (35 points)

Overview

For this portion of the midterm you will be broken into teams. Each team will be responsible for performing some additional exploratory analysis of the FEC data set. Each team will also be responsible for giving a ~15-20 minute presentation during class on 10/10. Below are your (modified) random group assignments:

Group Student
3 ADAM CICHON
3 JESSYCA FYE
1 CRAIG JENSEN
2 MITCHELL KOBAYASHI
1 MITCHELL MCDANIEL
4 ANNA OMNESS
4 OLIVIA PLATTE
4 TRISTEN SPENCER
2 SAMANTHA TERNES

Presentation Guidelines

Your group’s presentation should be around 15-20 minutes. You should attempt to distribute the talking evenly between group members.

Your group’s presentation will summarize the results of some additional exploratory data analysis of the data set you worked with in Part 1 of the midterm. This analysis can be as simple or as complex as you want. In the next section of this document I will show you how to augment the data set with additional information that will open up opportunities for interesting analyses.

At a minimum, your presentation must include the following:

  1. 1 slide explaining a question you hoped your exploratory analysis would answer
  2. 1 slide describing the code and methods you used to perform your analysis
  3. 1 slide (with a graph!) describing the result
  4. 1 slide that lists what you’ve liked about the class thus far
  5. 1 slide that lists things you would change about the class for the second half of the semester

Going above and beyond the minimum requirements is encouraged. You will probably have more than 5 slides for a 15-20 minute presentation.

Some analysis suggestions in case you can’t think of any:

  • Differences in total funding between Republicans and Democrats
  • Does increased funding correlate with stronger polling results?
  • Do certain states disproportionately fund one party or another?

Augmenting our Data with pollstR

The pollstR R package provides easy access to data from the HuffPost Pollster API. Here’s a description of the data from the API site:

The Pollster API provides programmatic access to the results of tens of thousands of opinion polls we’ve collected since 2004 as well as our estimates of the current opinion on various candidates and topics.

This vignette gives an excellent overview of the package capabilities. The project’s GitHub page also provides a good overview of the package goals and provides some interesting examples.

Below, I provide some code that adds information from the 2016 national primaries to the full data set and to a summarized version of the data set. The column value that is added to the data gives the estimated percent of the vote the candidate would earn in their party’s primary election.

As always, please install the package pollstR package prior to using it.

library(pollstR)

# getting data for the national primaries
rp <- pollstr_chart("2016-national-gop-primary")
dp <- pollstr_chart("2016-national-democratic-primary")

# extract the overall estimates from each data set and combine them
nat <- rbind(rp[['estimates']], dp[['estimates']])
nat <- filter(nat, !is.na(first_name))             

# add this informaiton to our full data set
dat$last_name <- sapply(strsplit(dat$cand_nm, split = ','), '[', 1)
dat_aug <- inner_join(dat, nat, by = 'last_name')


# say we only wanted to look at total gifts for each candidate, we would 
# do something like this
gift_totals <- dat %>% 
  group_by(cand_nm) %>% 
  summarise(total_gifts = sum(contb_receipt_amt))

gift_totals$last_name <- sapply(strsplit(gift_totals$cand_nm, split = ','), '[', 1)

mm <- inner_join(gift_totals, nat, by = 'last_name')